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Abstract 

Maximum likelihood estimation (MLE) is a fundamental computational problem in statistics. The 
problem is to maximize the likelihood function with respect to given data on a statistical model. An 
algebraic approach to this problem is to solve a very structured parameterized polynomial system 
called likelihood equations. For general choices of data, the number of complex solutions to the 
likelihood equations is finite and called the ML-degree of the model. 

The only solutions to the likelihood equations that are statistically meaningful are the real/positive 
solutions. However, the number of real/positive solutions is not characterized by the ML-degree. 
We use discriminants to classify data according to the number of real /positive solutions of the like¬ 
lihood equations. We call these discriminants data-discriminants (DD). We develop a probabilistic 
algorithm for computing DDs. Experimental results show that, for the benchmarks we have tried, 
the probabilistic algorithm is more efficient than the standard elimination algorithm. Based on the 
computational results, we discuss the real root classification problem for the 3 by 3 symmetric ma¬ 
trix model. 


1 Introduction 

We begin the introduction with an illustrative example. Suppose we have a weighted four-sided die 
such that the probability p, of observing side i (i = 0,1,2,3) of the die satisfies the constraint po + 2p\ + 
3p 2 — 4^3 = 0. We toss the die 1000 times and record a 4-dimensional data vector {uq,h\,U 2, 113), where 
Uj is the number of times we observe the side i. We want to determine the probability distribution 
(po, pi, pi, Pi ) G ]R > 0 that best explains the data subject to the constraint. One approach is by maximum 
likelihood estimation (MLE): 
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Maximize the likelihood function Po°pTP2 2 P3 3 subjected to 

Po + 2pi + 3p 2 - 4p 3 = 0, Po + Pl + P 2 + P3 = L 
p o > 0, pi > 0, p 2 > 0, and p 3 > 0. 


For some statistical models, the MLE problem can be solved by well known hill climbing algorithms 
such as the EM-algorithm. However, the hill climbing method can fail if there is more than one local 
maximum. Fortunately, it is known that the MLE problem can be solved by solving the system of 
likelihood equations fl5ll2l: 

Fo = P0A1 + P0A2 — «o F3 = P3A1 — 4P3A2 — U3 

h = P 1 A 1 + 2piA 2 - Mi F 4 = Po + 2pi + 3p 2 - 4 p 3 

F 2 = P 2 A 1 + 3/32^2 — M 2 F 5 = po + Pi + P 2 + P3 — 1 

where Ai and A2 are newly introduced indeterminates (Lagrange multipliers) for formulating the like¬ 
lihood equations. More specifically, for given («o. Mi, M2, M3), if (po. Pi, P2, P3) is a critical point of the 
likelihood function, then there exist complex numbers Ai and A2 such that (po. Pi, P2, P3, Ai, A2) is a so¬ 
lution of the polynomial system. For randomly chosen data zq, the likelihood equations have 3 complex 
solutions. However, only solutions with positive coordinates p, are statistically meaningful. A solution 
with all positive p, coordinates is said to be a positive solution. So an important problem is real root 
classification (RRC): 

For which zq, the polynomial system has 1,2 and 3 real/positive solutions? 

According to the theory of computational (real) algebraic geometry 126 . 201. the number of (real/positive) 
solutions only changes when the data zq goes across some "special" values (see Theorem |2|). The set 
of "special" zq is a (projective) variety (see Lemma 4 in |2Q| ) in (3 dimensional complex projective space) 
4-dimensional complex space. The number of real/positive solutions is uniform over each open con¬ 
nected component determined by the variety. In other words, the "special" iq- plays the similar role as 
the discriminant for univariate polynomials. The first step of RRC is calculating the "special" iq, leading 
to the discriminant problem: 

How to effectively compute the "special" zq? 

Geometrically, the "special" zq is a projection of a variety. So in principle, it can be computed by 
elimination ( see Chapter 3, page 115-128 in ( 6 j). For instance, by the command eliminate in Macaulay 2 
IflOl . we compute that the "special" zq in the illustrative example form a hypersurface defined by a 
homogenous polynomial in (uo. Mi, M 2 , M 3 ) (see Example 1). However, for most MLE problems, due to 
the large size of likelihood equations, the elimination computation is too expensive. In this paper, we 
discuss the "discriminant" problem for the likelihood equations. The contributions of the paper are 
listed as follows. 

• For likelihood equations, we show that the "special" zq form a projective variety. We call the 
homogenous polynomial that generates the codimension 1 component of the projective variety the data- 
discriminant. This name distinguishes it from the weight-discriminant for the likelihood equations (which 
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replaces the condition po + • • • + p n = 1 with the condition hopo + • • • + h n p n = 1 with parameters 
ho, • • •, h n ). 

• For algebraic statistical models, we develop a probabilistic algorithm to compute data-discriminants. 
We implement the algorithm in Macaulay 2 . Experimental results show that the probabilistic algorithm 
is more efficient than the standard elimination algorithm. 

• We discuss the real root classification for the 3x3 symmetric matrix model, which inspire future 
work. 

We remark that our work can be viewed as following the numerous efforts in applying computa¬ 
tional algebraic geometry to tackle MLE and critical points problems l!5l [2- 1. 16. 2r >. 12.i8l fl3l[18 ,14, 21 ]. 

The paper is organized as follows. The formal definition of the data-discriminant is introduced 
in Section [2] The standard elimination algorithm and the probabilistic algorithm are presented in 
Section [3j Experimental results comparing the two algorithms are shown in Section [4j The real root 
classification of the 3x3 symmetric matrix model and conclusion are given in Section 0 


2 Definition 

In this section, we discuss how to define "data-discriminant". We assume the readers are familiar with 
elimination theory (see Chapter 3 in l 6 ll). 

Notation 1. Let P denote the projective closure of the complex numbers C. For homogeneous polynomials 
g\, ■■■rgs in Q[po, ■ • •, Pn], v(gi,... ,gs) denotes the projective variety in P ' 1 defined by g lr ... ,g s . Let A„ 
denote the n-dimensional probability simplex {(po, ..., p n ) G K” +1 |po > 0 ,..., p n > 0 , po +-b p n = 1 }- 

Definition 1. JfT5f (Algebraic Statistical Model and Model Invariant) The set X is said to be an algebraic 

statistical model if X = V{g\,. .. ,g s ) IT A„ where g\,...,g s define an irreducible generically reduced projective 
variety. Each g^ (1 < k < s) is said to be a model invariant of X. 

For a given algebraic statistical model, there are several different ways to formulate the likelihood 
equations ffl5fl . In this section, we introduce the Lagrange likelihood equations and define the data- 
discriminant for this formulation. One can similarly define data-discriminants for other formulations 
of the likelihood equations. 

Notation 2 . For any f\,...,f m in the polynomial ring Q[*i, V«(/i,... ,f m ) denotes the affine vari¬ 

ety in C k defined by f\,... ,f m and (/i,.. denotes the ideal generated by f\,... ,f m . For an ideal I in 
Q[xi,..., x/ c ], Vn(f) denotes the affine variety defined by I. 

Definition 2. fT3l/ (Lagrange Likelihood Equations and Correspondence) Given an algebraic statistical 
model X. The system of polynomial equations below is said to be the Lagrange likelihood equations of X: 

Fo = po(Ai + ^-A 2 + • ■ • + -^Ag+i) — «o = 0 
dp 0 dp 0 


Fn= Pn( A! + ^A 2 

dPn 


+ A s +l) “ Un — 0 

dp n 
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F n +1 = gl{PO, i) = 0 

Fn+s = gs (PO, ■ ■ ■ / Pn) = 0 
fn+s+1 = PO + ' ' ' + Pi! — 1 = 0 

where g\,...,g s are the model invariants of X and uq,. .., u n , po,..., p n , Ai,..., A s+ i are indeterminates (also 
denoted by u, p, A). More specifically, 

- po, ■ ■ ., p n , Ai,..., A s+ i are unknowns, 

- uq, ..., u n are parameters. 

V a (F 0 , - - -, Fn+s+i), namely the set 

{(u,p,A) G C n+1 x C” +1 x C s+1 \F 0 = 0,...,F n+S+1 = 0}, 

is said to be the Lagrange likelihood correspondence of X and denoted by Cx- 

Notation 3. Let n denote the canonical projection from the ambient space of the Lagrange likelihood correspon¬ 
dence to the C ,!+1 associated to the u indeterminants n: C n+1 x C" +s+2 —> C" +1 . 

Given an algebraic statistical model X and a data vector u G R" 0 , the maximum likelihood estimation 
(MLE) problem is to maximize the likelihood function pf 1 ■ ■ ■ pf subject to X. The MLE problem can be 
solved by computing 7r _1 (u) IT Cx- More specifically, if p is a regular point of V(g\,... ,g s ), then p is 
a critical point of the likelihood function if and only if there exist A G C s+1 such that (u,p. A) G Ax- 
Theorem 1 states that for a general data vector u, n 1 (u) IT Cx is a finite set and the cardinality of 
7 T 1 (u) IT Cx is constant over a dense Zariski open set, which inspires the definition of ML-degree. For 
details, see Ifl5l . 

Theorem 1. HT5]I For an algebraic statistical model X, there exist an affine variety V C C n+1 and a non-negative 
integer N such that for any u g c ,!+1 \y, 

#7T _1 (u) IT Cx = N. 

Definition 3. Jl5j(ML-Degree) For an algebraic statistical model X, the non-negative integer N stated in 
Theorem [T| is said to be the ML-degree of X. 

Notation 4. For any S in C” +1 , Z(S) denotes the ideal 

{D G Q[u]| D(a 0 ,...,a n ) = 0 ,V(a 0 ,... ,a n ) G S}. 

S denotes the affine closure of S in C” +1 , namely V a (Z(S)). 

Definition 4. For an algebraic statistical model X, suppose Fo,..., F„ +s+1 are defined as in Definition |2] Let J 
denote 
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Then, we have the following: 

• Cxoo denotes the set of non-properness of n, i.e., the set of the u G 7t(£x) such that there does not exist 
a compact neighborhood U of u where 7T~ 1 (U) ft Cx is compact; 

• C X j denotes n(C x n V a (J)); 

• C Xp denotes tc{C x n V fl (njL 0 pjt)). 

The geometric meaning of Cx v and Cxj are as follows. The first, Cx v , is the projection of the inter¬ 
section of the Lagrange likelihood correspondence with the coordinate hyperplanes. The second, Cxj, 
is the projection of the intersection of the Lagrange likelihood correspondence with the hypersurface 
defined by /. Geometrically, Cxj is the closure of the union of the projection of the singular locus of Cx 
and the set of critical values of the restriction of n to the regular locus of Cx (see Definition 2 in |[20l ). 

The Lagrange likelihood equations define an affine variety. As we continuously deform the param¬ 
eters Ui, coordinates of a solution can tend to infinity. Geometrically, Cxoo is the set of the data u such 
that the Lagrange likelihood equations have some solution (p. A) at infinity; this is the closure of the 
set of "non-properness" as defined in the page 1, fl9l and page 3, If23l . It is known that the set of 
non-properness of n is closed and can be computed by Grobner bases (see Lemma 2 and Theorem 2 in 

1231). 

The ML-degree encaptures geometry of the likelihood equations over the complex numbers. How¬ 
ever, statistically meaningful solutions occur over real numbers. Below, Theorem[2| states that Cx oo, Cxj 
and Cxp define open connected components such that the number of real/positive solutions is uniform 
over each open connected component. Theorem 2 is a corollary of Ehresmann’s theorem for which there 
exists semi-algebraic statements since 1992 |5). 

Theorem 2. Tor any algebraic statistical model X, 

• if Ci,... ,Ct are the open connected components of 

R'' +1 \(£xoo U Cxj), 


then for each k (1 < k < t),for any u e C k , 

#/r“ 1 (u)n£xnR” +s+2 


is a constant; 

• if C\,... ,Ct are the open connected components of 


R' !+1 \(£xoo U C X j U C X p), 


then for each k (1 < k < t), for any u G C k , 

#tt^ 1 (u) n C x n (R^g 1 x R s+1 ) 


is a constant. 
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Before we give the definition of data-discriminant, we study the structures of Cxoo, C-xj, and C Xp 
below. 

• Proposition Q] shows that the structure of the likelihood equations forces C Xp to be contained in 
the union of coordinate hyperplanes defined by njt=o u k- 

• Proposition |2| shows that the structure of the likelihood equations forces £x/ \ {0 } to be a projective 
variety. 

• Similarly as the proof of Proposition |2 we can also show that the structure of the likelihood 
equations forces £xoo\{ 0 } to be a projective variety. 

Proposition 1. For any algebraic statistical model X, 


c Xp c v fl (n n k=0 u k ). 

Proof. By Definition^ for any k (0 < k < n ), 

u k = Pk{M + g ~ A2 A - f a^ As+1 ) “ Fk ' 


Hence, 

Uk e {h, Vk) n Q[u k ] c (F 0 / ..., F n+S+ 1 , p k ) n c[u] 


So 


Va((Fo,... / Fn-\-s+l, Pk) nc[u]) c v a (u k ) 


By the Closure Theorem |6fl . 


V„((F 0/ ... / F;; + s + l/ Pk) nc[u]) = n{C X C\ Va(pk)) 


Therefore, 


c Xp = tt(c x n v fl (n^ =0 pjt)) 

= n(c x nuz =0 v a (p k )) 

= ULo 7 r(£ x nv fl (p fc )) 

C U k=0 V a (u k ) 

= v„(n^ = 0 u fc ).n 

Remark 1 . Generally, C Xp 7 ^ V fl (lT p=0 u k ). For example, suppose the algebraic statistical model is V„(po — 
pi) n Ai. Then C Xp = 0/ V fl (u 0 wi). 

Notation 5. V Xp denotes the product IT k=0 u k . 

Proposition 2. For an algebraic statistical model X, we have £xj\{0} is a projective variety in P", zvhere 0 is 
the zero vector (0,... ,0) in C' I+1 . 
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Proof. By the formulation of the Lagrange likelihood equations, we can prove that Z(tt{£x Pi V a (j)) 
is a homogeneous ideal by the two basic facts below, which can be proved by Definition [2| and basic 
algebraic geometry arguments. 

Cl. For every u in tc(Cx H V n (J)), each scalar multiple au is also in tc(Cx Pi V n (J)). 

C2. For any S C C n+1 , if for any u G S and for any scalar a G C, au G S, then Z(S) is a homogeneous 
ideal in Q [u]. 

That means the ideal Z{n{Cx Pi V„ (/)) is generated by finitely many homogeneous polynomials D\, 

D m . Therefore, £ XJ = V a (Z(n(C x (T V«(J))) = V a (D lf ..., D m ). So £ X /\{0} = V(Dj,..., D m ) C P". 

□ 

Notation 6 . For an algebraic statistical model X, we define the notation T>xj according to the codimension of 
£x/\{0} in P". 

• If the codimension is 1, then assume V(Di),..., V(D X ) are the codimension 1 irreducible components in 
the minimal irreducible decomposition of £xj\{ 0} in P" and (Di), ..., (Dk) are radical. T>xj denotes the 
homogeneous polynomial IT f =1 Dj. 

• If the codimension is greater than 1, then our convention is to take T>xj = 1. 

Similarly, we use the notation T>xoo to denote the projective variety £x/\{0}- Now we define the 
"data-discriminant" of Lagrange likelihood equations. 

Definition 5. (Data-Discriminant) For a given algebraic statistics model X, the homogeneous polynomial 
Z^xco ■ Vxj ■ Vx v is said to be the data-discriminant (DD) of Lagrange likelihood equations of X and denoted 
byV x . 

Remark 2. Note that DD can be viewed as a generalization of the "discriminant'' for univariate polynomials. 

So it is interesting to compare DD with border polynomial (BP) If26il and discriminant variety (DV) |20|. DV 
and BP are defined for general parametric polynomial systems. DD is defined for the likelihood equations but can 
be generalized to any square and generic zero-dimensional system. Generally, for any square and generic zero¬ 
dimensional system, V a {DD) C DV C V a (BP). Note that due to the special structure of likelihood equations, 

DD is a homogenous polynomial despite being an affine system of equations. However, generally, DV is not a 
projective variety and BP is not homogenous. 

Example 1 (Linear Model). The algebraic statistic model for the four sided die story in Section [l] is 
given by 

X = V(p 0 + 2pi + 3 p 2 - 4 p 3 ) Pi A 3 . 

The Langrange likelihood equations are the Fq = 0,..., F 5 = 0 shown in Section |T] The Langrange 
likelihood correspondence is Cx = V„(Fo,... ,Fs) C C 10 . If we choose generic (uo, U\,U 2 ,uf) G C 4 , 

7T — 4 (mo/ «i, « 2 , M 3 ) PI £x = 3, namely the ML-degree is 3. The data-discriminant is the product of T>xco, 

T>xp and V X j, where 

F’xoo = Mo + Mi + u 2 + m 3 , V Xp = m 0 mim 2 m 3 , and 

Dxj = 441 Ug + 4998i(gi(i + 20041 Ugii^ + 33320i(o !( i + 19600uj — 756ugi(2 + 20034;ig«i;i2 + 83370iigu^H2 + 79800k^W 2 — 5346/<§z<2 + 
55890uoiii»| + 119025j(ji(| +4860 »oh| + 76950i/i»| + 18225j;| — 1596i(gi/3 — 11116iigi/ii(3 — 17808ugu^u3 + 4480iijt(3 + 7452»gU2“3 — 7752ugUiU2U3 + 
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49680uji<2»3 — 17172 j(o;( 2 !, 3 + 71460»iw|i(3 + 27540 i<2 

»3 + 2116ugu| + 6624 K 0 K 1 U 3 — 4224//^— 9528(/o»2K 3 + 15264((i// 2(/§ + 14724((2i(3 — I 2 I 6 U 0 K 3 — 512i/i 1/3 + 3264((2Ug + 256i(g. 

By applying the well known partial cylindrical algebraic decomposition (PCAD) (U method to the 
data-discriminant above, we get that for any (uq, U\, 112 , 113 ) E IR 4 , 0 , 

• if V XJ (u 0 , Ui, u 2 , M 3 ) > 0, then the system of likelihood equations has 3 distinct real solutions and 
1 of them is positive; 

• if T>xj{uo, Ui, 11 2 , M 3 ) < 0 , then the system of likelihood equations has exactly 1 real solution and it 
is positive. 

The answer above can be verified by the RealRootClassif ication |26l [3l command in Maple 17. In 
this example, the Vxoo does not effect the number of real/positive solutions since it is always positive 
when each Uj is positive. However, generally, 'D Xco plays an important role in real root classification. 
Also remark that the real root classification is equivalent to the positive root classification for this 
example but it is not true generally (see Example 6 ). 

3 Algorithm 

In this section, we discuss how to compute T> x ■ We assume that X is a given statistical model, 
Fo ,..., F n + S +1 ar e defined as in Definition^ and / is defined as in DefinitionlU We rename Fq, ..., F n+s+ i 
as Fo,... ,F m . Subsection 13.11 presents the standard elimination algorithm for reference and Subsection 
13.21 presents our main algorithm (Algorithm |2|). 

3.1 Standard Elimination Algorithm 

Considering the data-discriminant as a projection drives a natural algorithm to compute it. This is the 
standard elimination algorithm in symbolic computation: 

• we compute the C x / by elimination and then get V x j by the radical equidimensional decomposition 
(see Definition 3 in f20l~). The algorithm is formally described in the Algorithm [0 


Algorithm 1: DX-J 

input : F 0 ,...,F m ,J 
output: T>xj 

1 5 u(- the generator polynomial set of the elimination ideal (Fo,... F m , J) fl Q[u] 

2 T>xj t— the codimension 1 component of the equidimensional radical decomposition of (Q n ) 

3 return V X j 


• we compute C Xca by the Algorithm PROPERNESSDEFECTS presented in lf20H and then get 'D Xoo by 
the radical equidimensional decomposition. We omit the formal description of the algorithm. 

The previous algorithms in this subsection can not be used to compute DDs of algebraic statistical 
models in a reasonable time, see Tables 1-2 in Section [4j This motivates the exploration of a more 
practical method found in the next subsection. 
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3.2 Probabilistic Algorithm 

First, we prepare the lemmas, then we present the main algorithm (Algorithm [2]l. 

• Lemma 1 is used to linearly transform parameter space. 

• Corollary 1 and Lemma 2 are used to compute the totally degree of T> X j- 

• Corollary 2 is used in the sampling for interpolation. 

Lemma 1. For any G G Q[u], there exists an affine variety V in C" such that for any (a\,... ,a n ) G C”\V, the 
total degree of G equals the degree ofB(to, t\,..., t n ) zv.r.t. to to, where 

B(to, t\,...,t n ) = G(fo,flifo + t\,... ,a n to + t n ) 

Proof. Suppose the total degree of G is d and G t j is the homogeneous component of G with total 
degree d. For any (l,a lr ... ,a n ) G C n+ 1 \V fl (G d ), let B(t 0 , t \,..., t n ) = G(t 0 , flifo + h,...,a n t 0 + t n ). It is 
easily seen that the degree of B w.r.t. to equals d. □ 

Corollary 1. For any G G Q[u], there exists an affine variety V in C 2n+2 such that for any 

Oo, b 0 ,..., an, b n ) G C 2 n+ 2 \ V, 
the total degree of G equals the degree ofB(t ) where 

B[t) = G(aot + bo ,..., a n t + b n f 

Lemma 2. There exists an affine variety V in C Zll+2 such that for any ( ao, bo,..., a n , b n ) G C 2 n+ 2 \y, if 

(A(t)} = (Fo(t),... ,F„(t),F n+ i,... ,F m ,J ) nQ[f] 
where Fft ) is the polynomial by replacing u l with ajt + b l in Fj (i = 0,..., n) and 

B(t) = T>xj(aot + bo,...,a n t + b n ), 


then ( B(t )) = y/(A(t)). 

Proof. By the definition of T>xj (Notation [6]), there exists an affine variety V\ such that for any 
( ao,bo,■ ■ ■,a n ,b n ) G C 2 ” +2 \Vi, ( B(t )) is radical. Thus, we only need to show that there exists an affine 
variety V 2 in C 2 ” +2 such that for any (a 0 , b 0 , ■.. ,a n , b n ) G C 2n+2 \V 2 , V„((B(t))) = V ll ((A(t))). 

Suppose 7Tf is the canonical projection: C x C"' +1 —> C. For any 

t* G 7Z t (V a (Fo(t),... ,F n (t),F n+ i,... ,F m ,j)), 

let u* = ad* + bj (for i = 0,..., n), then (t/jj,..., u*) G 7t(£ x H Hence V X j(uq, • • •, u*) = 0 and 

so B(t*) = 0. Thus 

B(t) G l(n t (V„(Fo(t),... ,F n (t),F n+ i, .. .,F m ,J)). 
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Therefore, 


V a {A{t)) = V a (l(n t (V a (Fo(t),...,F n (t),F n+1 ,...,F m ,J))) 

C V fl (B(f)). 

For any t* G V fl ((B(f))), let u* = djt* + bj for i = 0 then (wg,. ..,«*) G V„(Dx/) C /lx/- By the 
Extension Theorem Q, there exists an affine variety V 2 C C * 1 2 3 4 5 6 7 " 1 2 such that if (fly, bo, ..., a n , b n ) f V 2 , 
then (m£,. ..,«*) G /r(£ x n V„(/)), thus 

f* e 7Tf(V fl (Fo(f ) / ... ,F n (t),F n+ i,.. .,F m ,/)) C V«(A(t)).n 

Corollary 2. T/tere exists an affine variety V in C" such that for any (a\,... ,a n ) G C”\V, if 

(A(« 0 )> = <fb / F 1 *... / F„* / F w+ i / ... / F M/ /)nQ[«o] 

w/tere F* is i/rc polynomial by replacing ip with ai in Fj (i = 1,..., n) and 

B(u 0 ) = V XJ (uo,ai,...,a n ), 


then (B(uq)) = y/{A(u 0 )}. 


Algorithm 2: (Main Algorithm) InterpolationDX-J 


input : F 0 ,...,F m ,J 
output: T>xj 

1 «i,..., a n F- LinearOperator(F 0 ,..., F m , ]) 

2 for i from 1 to n do 

3 Ft i— replace ii; in F; with a,«o + u i 

4 NewSys -t-ft, F{ ..., F(f F„+i,..., F m , J 

5 d, dg,... ,d n <— F>egree(NewSys) 

6 for j from I to d do 

7 Rename all the monomials of the set 

{uf ■ ■ ■ u * n |«i + - /,0 < a, < rf/} 

tfy,i/ ■ • ■ / 
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10 

11 

12 


N -s— max(Ni,..., Nj) 
for k from 1 to IV do 

t’t.i, • • •, h,n f— random integers 
A(uo) ■f-Intersect(NeicSys,bj :; i,..., b kn ) 

C* ik ,... ,C kk -s— the coefficients of A(uq) w.r.t u®, 


13 

14 

15 


for j from 1 to d do 

Mj <— Nj x Nj matrix whose ( k, r)-entry is Uj jr (b k/k ,b ki „) 
Cj<~(U,, .. U jiN .)Mr\Cf .C* N .) r 


16 Dxj <— replace u k ,..., u n in »g + Y.f = gCd-iU l 0 with u k — a k ■ tig,...,u n — a n ■ iig 

17 Return T>xj 
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We show an example to explain the basic idea of the probabilistic algorithm and how the lemmas 
work in the algorithm. 

Example 2 (Toy Example for Interpolation Idea). Suppose the radical of the elimination ideal (F, ]) fl 
<Q[u] is generated by D(uo, “l, “ 2 )/ where F = uop 2 + U\p + 112 and / = 2uop + U\. We already know 
that D is homogenous and equals u\ — 4 mo« 2 - Rather than by the standard elimination algorithm, we 
compute D by the steps below. 

• First, we substitute u$ = t + 11, u\ = 3t + 2 and U 2 = 5t + 6 into F and / (the integers 1,11,3,2,5 
and 6 are randomly chosen). We compute the radical of the elimination ideal ( F(t,p),](t,p)) HQ[f] 
and get (lit 2 + 232 1 + 260). By Lemma[2l D(t + 11, 3t + 2,5f + 6) = lit 2 + 232 1 + 260. By Corollary!]} 
the total degree of D is 2 (it geometrically means the random line Uq = t + 11, u\ = 3t + 2, it 2 = 5t + 6 
intersect our desired hypersurface at 2 points in the parameter space and it is exactly the definition of 
the degree of hypersurface). Similarly, we compute the degree of D w.r.t Uq,U\ and 112 and get 1,2 and 
1, respectively. So all the possible monomials in D are it 2 , UqU\, u\ 112 and «o u 2 - 

• Assume D = u 2 + ( C\Uq + C 2 U 2 )u\ + C 3 M 0 M 2 . We first substitute uo = 13 and 1/2 = 4 into F and 
/. We compute the radical of the elimination ideal (F(ui,p),J(ui,p)) DQ[mi] and get (u 2 — 208). By 
Corollary^ D(13, «i,4) equals u\ — 208. Hence, 13Ci + 4 C 2 = 0 and 52 C 3 = —208. Therefore, C 3 = —4. 
We need one more evaluation to solve Ci and C 2 . So we substitute uq = 7 and 112 = 3 into F and /. 
Similarly, we get 7C\ + 3 C 2 = 0 and thus C\ = C 2 = 0. Therefore, D = u 2 — 4 uqU 2 (the integers 13,4,7,3 
are randomly chosen). 

This example is "nice". Because the degree of D w.r.t u-\ equals the total degree of D. In general 
case, if there is no iq such that the degree of D w.r.t u t equals the total degree, then we should apply the 
linear transformation to change the parameter coordinates before interpolation. Lemma [l] guarantees 
the linear transformation makes sense. 


Algorithm 3: Intersect 

input : F 0 ,..., F m , J and integers foi,. .., b n 
output: V X ;(uo,bi,... / b„) 

1 for i from 1 to n do 

2 |_ h* replace u ; in F*. with fc; 

3 A(ug) the generator of the elimination ideal (Fo,Fj*,..., F*, F„ + i,. .. F m , /) nQ[« 0 ] 

4 A(ug) <— the monic generator of + (A(ug)) 

5 return A(ug) 


Now we are prepared to introduce the probabilistic algorithm for computing the T>xj- We explain 
the main algorithm (Algorithm [2]) and all the sub-algorithms (Algorithms 4-6) below. 

Algorithm[5](Degree). The probabilistic algorithm terminates correctly by Corollary |l] and Lemmal2j 
Algorithm 0] (LinearOperator). The probabilistic algorithm terminates correctly by Lemma [0 
Algorithm!!](Intersect). The probabilistic algorithm terminates correctly by Corollary [2j 
Algorithm [2] (InterpolationDX-J). 

Lines 1-5. We compute the total degree of T>xj and the degrees of T>xj w.r.t Uq, ..., u^'. d,do,..., d n 
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Algorithm 4: LinearOperator 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 


input : Fq,.. F m , J 

output: a±, ... ,a„ such that the total degree of T>xj equals the degree of T>x]{ug,ai ■ ug + u \,. . ■ ug + u n ) w.r.t u g 

d,dg,...,d n <—Degree(Fo/• • ■,F„„ J) 

if d = dg then 

return 0 ,..., 0 


else 


repeat 

for i from 1 to n do 

a random integer 

Fj <— replace u,- in F; with fl; ■ ug + i/,- 

NewSys <-F 0 ,F[..., F',F„ +1/ ...,F m ,] 
d,dg,... ,d n t—Degree(NewSys) 

until d = dg 


12 return a\,... ,a n 


by Algorithm |5] Algorithm |4| guarantees that do = d by applying a proper linear transformation 
ll\ — U\ • U-Q -(- U\f . . . , llyi — Clyi ' t/o H“ 

Lines 6-7. Suppose T>xj = + Quq -1 + ... + Q_it/o + Q where C\,..., Q G Q[ih/ • • •, w«] and 

the total degree of C ; is /. For j = 1,... ,n, we estimate all the possible monomials of Cj by computing 
the set 

{< • ■ • w“ n |cti + ... + a. n = j,0 < a-i < di} 

Assume the cardinality of the set is Nj and rename these monomials as Uj t , Uj^.. Then we assume 

Cj = Cj'iUj'i + ... + 

where Cj r \, ..., Cj i ^ i G Q. The rest of the algorithm is to compute Cp,..Cjp r 

Lines 8-12. For each j, for k = 1 , ..., Nj, for a random integer vector b ^ = (b/ c ,i, ..., we compute 
V X j(u 0 ,bk) by Algorithm |3] That means to compute the function value C/(b/J without knowing T>xj- 
Lines 13-15. For each j, we solve a square linear equation system for the unknowns c,-j,..., Cjp.\ 

i(bfc) + ... T (b^) = 

(k = l,...,Nj) 

It is known that we can choose nice bfc probabilistically such that the coefficient matrix of the linear 
equation system is non-singular. 

Lines 16. We apply the inverse linear transformation in the parameter space to get the T>xj f° r the 
original F 0 ,...,F m . 

We can also apply the interpolation idea to Algorithm PROPERNESSDEFECTS i20l and get a probabilis¬ 
tic algorithm to compute the T>x oo- We omit the formal description of the algorithm. 

Remark 3. According to the Notation\6\ when the codimension of £xj\{ 0} 6Cxoo\{0}) is greater than 1, zve 
define T>xj CDxco) is 1. Therefore, it is no more true that the number of real/positive solutions still remains 
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Algorithm 5: Degree 

input : F 0r ...,F m ,J 

output: d,dg,... d n , where d is the total degree of Vxj and d; is the degree of T>x] w.r.t each u; (i = 0, ... ,n) 

1 for i from 0 to n do 

2 F 0 *,...,F* <- replace uo, ■ ■ ■, Ui+i, ■ ■ ■, u n in Fg, ■ ■ ■, F„ with random integers 

3 A(ui) 4— the generator of the elimination ideal (Fg ,..., Lz , F„ + l/ • • ■ Fmr J) FI Q[Uj] 

4 A(uj) <— the generator of yj(A(iij)) 

5 dj <— degree of A(u{) 

6 a,-, bj «— random integers 

7 Fg(t),... ,F n (t) <— replace tig,... ,u n with cig ■ f + bg, ... ,a„ ■ t + b n in Fg,... ,F n 

8 A(t) the generator of the elimination ideal (Fg(t),... ,F„(t),F n+ i,... F m ,J) n Q[f] 

9 A(t) i— the generator of y/ ( A(t )} 

10 d •<— degree of A(t) 

11 return d, dg,..., d n 


constant over the region determined by the data-discriminant. That means if the output of the Algorithm |2] is 1, 
we should use the standard method (elimination or computing Grobner base). 

4 Experimental Timings 

We have implemented the probabilistic algorithm in Ma-caulay2. We have also implemented the stan¬ 
dard algorithm in Macaulay2 to do comparisons (Tables 1 and 2). Some of the necessary implementation 
details are shown below. 

• In the Algorithm HI Line 1, Algorithm [3] Line 3 and Algorithm [5] Lines 3 and 8, we use the 
Macaulay2 command eliminate to compute the elimination ideals. 

• The probabilistic algorithm is implemented in two different ways. The first implementation is to 
interpolate at once, which is exactly the same as the Algorithm [2] The second implementation is to 
interpolate step by step. For example, suppose the T>xj is a polynomial in Uq, u\, U 2 and 1/3, we first 
compute T>xj(uo, u\, uf uf) by interpolation for some chosen integers ar >d uf And then we compute 
'Dx](uq, U], u 2 , uf) by interpolation. At this time, it is easy to recover T>xj since T>xj is homogeneous. 
The algorithm is naive to describe so we omit the formal description. 

We run Algorithms Q] and |2| for many examples to set benchmarks by a 3.2 GHz Inter Core i5 
processor (8GB total memory) under OS X 10.9.3. There are two kinds of benchmarks, the random 
models and literature models. 

• We generate 2 groups of "random models". The first group of random models are generated 
as follows. We first generate a random homogenous polynomial in 3 variables po, p \ and P 2 with total 
degree 2. Suppose this homogenous polynomial is a model variant. We repeat the process for 10 rounds 
and get 10 random models. We call this group of 10 models 2 deg-models. Similarly, we generate the 
group of 3 deg-models. The Table 1 provides the timings of Algorithm |l] and Algorithm |2] (with two 
different implementations) for 2 deg-models and 3 deg-models. 

• The literature models are the examples presented in the literatures lH5ll7lll3ll. Table 2 provides the 
timings of Algorithm [l] and Algorithmic (with two different implementations) for the literature models. 
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Table 1: Timings of Computing T>xj for Random Models (s: seconds; h: hours; SI: Strategy 1; S2: 
Strategy 2) 


2 deg-models 

3deg-models 

Algorithm^ 

Algorithmic] 

Algorithm^ 

Algorithmic] 

SI 

S2 

ST 

S2 

4.9s 

0.8s 

0.6s 

>2h 

800.4s 

901.2s 

3.0s 

0.7s 

0.6s 

>2h 

777.3s 

871.5s 

5.0s 

0.8s 

0.6s 

>2h 

1428.9s 

1499.5s 

5.4s 

0.8s 

0.7s 

>2h 

1118.9s 

1192.9s 

6.3s 

0.8s 

0.7s 

>2h 

448.9s 

489.8s 

3.9s 

0.7s 

0.6s 

>2h 

1279.6s 

1346.1s 

2.0s 

0.7s 

0.5s 

>2h 

1286.5s 

1409.0s 

1.7s 

0.7s 

0.5s 

>2h 

1605.9s 

1620.9s 

3.8s 

0.8s 

0.6s 

>2h 

1099.4s 

1242.6s 

5.8s 

0.8s 

0.7s 

>2h 

1229.0s 

1288.7s 


For Examples 3-5 in the Table 2, the model invariants for these models are list below. Example 6 is 
given in Section [5711 

Example 3 (Random Censoring (Example 2.2.2 in CD)). 


2 pOPlP 2 + p\pi + Pip\ - P0P12 + P1P2P12 


Example 4 (3 x 3 Zero-Diagonal Matrix 1131 ). 


0 


det 


P 21 
. P31 


Pl2 

0 

P32 


Pl3 

P23 

0 


Example 5 (Grassmannian of 2-planes in C 4 Ifl5l[l3l ). 


P12F34 — P13P24 + P14P23 

In the Tables 1-2, the columns "Algorithm 1" give the timings of Algorithm [l] The columns "Algo¬ 
rithm 2" give the timings of Algorithm [3 where "SI" and "S2" means the first and second implementa¬ 
tions, respectively The red data means the computation has not finished and received no output. It is 
seen from the tables that 

• for all the benchmarks we have tried, the Algorithm [2] is more efficient than Algorithm [T} 

• for the random models and Example 3, the two implementations of Algorithm |2] have almost the 
same efficiency; 

• for Examples 4-6, the second implementation (interpolation step by step) of Algorithm [2] is more 
efficient than the first implementation (interpolation at once). In fact, it takes the same time for the two 
implementations to get sample points. But it takes more time for the first implementation to compute 
the inverse of Aij in Algorithm |2| Line 13, which is a large size matrix with rational entries. 

• for Example 6, with the standard elimination algorithm, our computer runs out of memory after 
12 days. 

Note that for each benchmark, the output of Algorithm [2] is the same as Algorithm [[] when both 
algorithms terminate. 
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Table 2: Timings of Computing T>xj for Literature Models (s: seconds; h: hours; d: days; SI: Strategy 
1; S2: Strategy 2) 


Models 

Algorithm^ 

Algorithm|2| 

SI 

52 

Example 3 

11.1s 

5.3s 

6.4s 

Example 4 

36446.4s 

360.2s 

56.3s 

Example 5 

>16h 

>16h 

2768.2s 

Example 6 

>12d 

>30d 

30d 


5 Conclusions and last example 

In order to classify the data according to the number of real/positive solutions of likelihood equations, 
we study the data-discriminant and develop a probabilistic algorithm to compute it. Experiments show 
that the probabilistic algorithm is more practical than the standard elimination algorithm. This is our 
first application of real root classification method on the MLE/likelihood equations problem. Our 
future work aims to 

• improve Algorithmic (note that Algorithm [2] is applying evaluation/interpolation technique to the 
standard method. It is not the first time that such an approach is investigated. In 01241, Newton-Hensel 
lifting has been applied to compute (parametric) geometric resolutions. It is hopeful that Algorithm [2] 
will be more powerful if we apply the Newton-Hensel lifting techniques to balance the time consuming 
of the evaluation and lifting steps); 

• study the data-discriminants of different formulations of likelihood equations for the same alge¬ 
braic statistical model 

• develop algorithms for computing real root classification for likelihood equations. 

More broadly, the ideas in Subsection [T2] and Algorithm |2] can be applied to compute discriminants 
when the Newton polytope is known. 

5.1 3x3 symmetric matrix model 

We end the paper with the discussion of real root classification on the 3x3 symmetric matrix model. 

Consider the following story with dice. A gambler has a coin, and two pairs of three-sided dice. 
The coin and the dice are all unfair. However, the two dice in the same pair have the same weight. He 
plays the same game 1000 rounds. In each round, he first tosses the coin. If the coin lands on side 1, 
he tosses the first pair of dice. If the coin lands on side 2, he tosses the second pair of dice. After the 
1000 rounds, he records a 3 x 3 data matrix [ujj] ( i,j = 1,2,3) where 77, ; is the the number of times for 
him to get the sides i and j with respect to the two dice. By the matrix [77,,], he is trying to estimate the 
probability p- of getting the sides i and j with respect to the two dice. 

It is easy to check that the matrix 


" Pu 

P12 

P13 

P21 

P22 

P23 

. P31 

P32 

P33 . 
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is symmetric and has at most rank 2. Let 


Pij 


| P‘j ' 1 f u _ /“«; ' “ _ 

UP/y '<)' 11 1 utj + Uji i<j 


We have an algebraic statistical model below. 

Example 6 (3 x 3 Symmetric Matrix Model). The algebraic statistical model for the dice story is 
given by 

x = v(g) n a 5/ 


where 


8 — det 


2pn p 12 Pl3 
Pl2 2pi2 P23 
Pl3 P23 2 p 33 


A5 — {(pil, • ■ ■, P33) G Ko\Pn + P12 + Pl 3 + P22 + P23 + P 33 — !}• 


The gambler's problem is equivalent to maximizing the likelihood function 


n Pi? 


(i < j) sub¬ 


jected to V(g) fl A 5 . According to the Definition 2, we present the Langrange likelihood equations 
below. 


Fo = P11A1 + (8p 2 2P33 - 2p| 3 )puA2 - Mil = 0 
Fl = p 12 Ai + (2p 13 p 2 3 — 4p 12 p33)pi2A 2 — m 12 = 0 
F2 — P13A1 + (2pi2P23 — 4p 13 p 22 )p 13 A 2 — U \3 = 0 
F3 = P22A1 + (8P11P33 — 2p3 3 )p 22 A 2 — U22 = 0 
F 4 = P 23 A 1 + ( 2 P 12 P 13 — 4pnp 2 3)p 23 A 2 — m 23 = 0 
E5 = P33A1 + (8pn P22 — 2pj 2 )p3 3 A 2 — M33 = 0 
= g(Pll, Pl2/ Pl3f P22, P23r P33) = 0 
E7 = Pit + P12 + Pl3 + P22 + P23 + F33 — 1 = 0 


where pn, P 12 , pi 3 , P 22 , P 23 , P 33 , Ai and A 2 are unknowns and Mu, M 12 , i /13 , U 22 , U 23 and 1/33 are parame¬ 
ters. 

We have 8 equations in 8 unknowns with 6 parameters and the ML-degree is 6 Ifl5l . By the Algo¬ 
rithm [2j we have computed T>xj, which has 1307 terms with total degree 12. By a similar computation, 
we get Dx J^whose last factor is exactly g(u\\, ..., U 33 ) and all the other factors are positive when each 
U{ is positive. 

For the data-discriminant T>x we have computed above, we have also computer^ at least one rational 
point (sample point) from each open connected component of T>x / 0 using RAGlib[22, T7HTT1. With 
these sample points we can solve the real root classification problem on the open cells. By testing all 
236 sample points, we see that if g(un, ■ ■ ■, U 33 ) 

7 ^ 0 , ther| 

^See T>xj and T>x 00 on the second author's website: 
sites.google.com/site/rootclassificaiton/publications/DD 

“The sample points were first successfully computed by one of the anonymous referees. 

■^This proves the real version of the RRC conjecture in the previous version of this manuscript. 


16 






- if T>xj{u\\, . .., U 33 ) > 0, then the system has 6 distinct real solutions and there can be 6 positive 
solution or 2 positive solutions; 

- if Vxj(u n , ■ ■ ■, ^ 33 ) < 0, then the system has 2 distinct real (positive) solutions. 

With 2 of these sample points, we see that the sign of T>x is not enough to classify the posi¬ 
tive solutions. For example, for the sample point (m u = l,u 12 = 1,m 13 = ^1% - } u 22 = 1,m 23 = 

34089009205592922038535 ., _ 32898355113670387769001 \ U™ I™ 7. tA/U;U. (.„■ i-U.. 

141080698675730650759168 ' “33 - 141080698675730650759168 )> tne system nas o distinct positive solutions, vvnne tor me 
sample point (u n = 1, m 12 = 1, w 13 = 199008, u 22 = 30, u 23 = 2022, h 33 = 1), the system has also 6 real solutions 
but only 2 positive solution^. 
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